Goto

Collaborating Authors

 improved sample complexity


Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Neural Information Processing Systems

We study the problem of exploring an unknown environment when no reward function is provided to the agent. Building on the incremental exploration setting introduced by Lim and Auer (2012), we define the objective of learning the set of $\epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps (in expectation) from a reference state $s_0$. In this paper, we introduce a novel model-based approach that interleaves discovering new states from $s_0$ and improving the accuracy of a model estimate that is used to compute goal-conditioned policies.


Improved Sample Complexity for Multiclass PAC Learning

Neural Information Processing Systems

We aim to understand the optimal PAC sample complexity in multiclass learning. While finiteness of the Daniely-Shalev-Shwartz (DS) dimension has been shown to characterize the PAC learnability of a concept class [Brukhim, Carmon, Dinur, Moran, and Yehudayoff, 2022], there exist polylog factor gaps in the leading term of the sample complexity. In this paper, we reduce the gap in terms of the dependence on the error parameter to a single log factor and also propose two possible routes towards completely resolving the optimal sample complexity, each based on a key open question we formulate: one concerning list learning with bounded list size, the other concerning a new type of shifting for multiclass concept classes. We prove that a positive answer to either of the two questions would completely resolve the optimal sample complexity up to log factors of the DS dimension.


Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Neural Information Processing Systems

We study the problem of exploring an unknown environment when no reward function is provided to the agent. Building on the incremental exploration setting introduced by Lim and Auer (2012), we define the objective of learning the set of \epsilon -optimal goal-conditioned policies attaining all states that are incrementally reachable within L steps (in expectation) from a reference state s_0 . In this paper, we introduce a novel model-based approach that interleaves discovering new states from s_0 and improving the accuracy of a model estimate that is used to compute goal-conditioned policies. The resulting algorithm, DisCo, achieves a sample complexity scaling as \widetilde{O}_{\epsilon}(L 5 S_{L \epsilon} \Gamma_{L \epsilon} A \epsilon {-2}), where A is the number of actions, S_{L \epsilon} is the number of states that are incrementally reachable from s_0 in L \epsilon steps, and \Gamma_{L \epsilon} is the branching factor of the dynamics over such states. This improves over the algorithm proposed in (Lim and Auer, 2012) in both \epsilon and L at the cost of an extra \Gamma_{L \epsilon} factor, which is small in most environments of interest.


Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes

Mondal, Washim Uddin, Aggarwal, Vaneet

arXiv.org Artificial Intelligence

We consider the problem of designing sample efficient learning algorithms for infinite horizon discounted reward Markov Decision Process. Specifically, we propose the Accelerated Natural Policy Gradient (ANPG) algorithm that utilizes an accelerated stochastic gradient descent process to obtain the natural policy gradient. ANPG achieves $\mathcal{O}({\epsilon^{-2}})$ sample complexity and $\mathcal{O}(\epsilon^{-1})$ iteration complexity with general parameterization where $\epsilon$ defines the optimality error. This improves the state-of-the-art sample complexity by a $\log(\frac{1}{\epsilon})$ factor. ANPG is a first-order algorithm and unlike some existing literature, does not require the unverifiable assumption that the variance of importance sampling (IS) weights is upper bounded. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $\mathcal{O}(\epsilon^{-\frac{1}{2}})$ and simultaneously matches their state-of-the-art iteration complexity.


Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies

Fatkhullin, Ilyas, Barakat, Anas, Kireeva, Anastasia, He, Niao

arXiv.org Artificial Intelligence

Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed the development of their theoretical foundations. Despite the huge efforts directed at the design of efficient stochastic PG-type algorithms, the understanding of their convergence to a globally optimal policy is still limited. In this work, we develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies which allows to address the case of continuous state action spaces. First, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $\tilde{\mathcal{O}}(\varepsilon^{-2.5})$ sample complexity of this method for finding a global $\varepsilon$-optimal policy. Improving over the previously known $\tilde{\mathcal{O}}(\varepsilon^{-3})$ complexity, this algorithm does not require the use of importance sampling or second-order information and samples only one trajectory per iteration. Second, we further improve this complexity to $\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$ by considering a Hessian-Aided Recursive Policy Gradient ((N)-HARPG) algorithm enhanced with a correction based on a Hessian-vector product. Interestingly, both algorithms are $(i)$ simple and easy to implement: single-loop, do not require large batches of trajectories and sample at most two trajectories per iteration; $(ii)$ computationally and memory efficient: they do not require expensive subroutines at each iteration and can be implemented with memory linear in the dimension of parameters.